-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddtrace/tracer: add integration tag to spans_started/finished #3023
base: main
Are you sure you want to change the base?
Conversation
Datadog ReportBranch report: ✅ 0 Failed, 5211 Passed, 72 Skipped, 2m 46.57s Total Time |
BenchmarksBenchmark execution time: 2025-01-21 22:23:34 Comparing candidate commit 3e8b1df in PR branch Found 0 performance improvements and 5 performance regressions! Performance is the same for 53 metrics, 1 unstable metrics. scenario:BenchmarkInjectW3C-24
scenario:BenchmarkSingleSpanRetention/no-rules-24
scenario:BenchmarkSingleSpanRetention/with-rules/match-all-24
scenario:BenchmarkSingleSpanRetention/with-rules/match-half-24
scenario:BenchmarkTracerAddSpans-24
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so you're reporting spansStarted/spansFinished on span.Start/span.Finished if the integration is not empty, and leaving the chunk reporting to any spans that are manual... I understand why you did this but not totally sure about the approach.
span.Start and span.Stop are typically called quite frequently, so if a majority of the spans are from automatic integrations, this will be very noisy (and defeats the purpose of reporting the metrics at a specified interval, to reduce noise)
One alternative idea:
Change the way we track spansStarted and spansFinished to be some kind of counter map that includes the integration name, e.g. map[string]uint32
where the key is the integration name and the value is the count of spans started/finished that integration name. Then, in this goroutine, we'll have to iterate over the map and report the spans started/finished per integration
(or some other idea I haven't thought of?)
@mtoffl01 Good points! A map would probably work better; I was hesitant at first since I didn't want to change too much of what already exists, but knowing that these metrics are pretty old... I'm more down to change it up now. |
@hannahkm I'm approving this but we should investigate why the benchmarks report the increased allocations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I definitely have some concerns 🤔 Maybe you can write some additional tests to provide peace of mind....
- Tests designed to try and make the system fail -- what happens when you have multiple goroutines access a start span / finish span method, can we prove that we've protected against a race condition?
- Maybe you want to write dedicated benchmarks to show how much performance is impacted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It lgtm, BUT I would recommend we have more complex tests for SpansStarted and SpansFinished; like, generating multiple spans of different integrations and checking the metrics reported (rather than just 1 span).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but I think there is one small bug, see below. I might have some more feedback on the tests and benchmarks, but not enough time for it right now. But I think these comments should already be useful 🙇 .
@@ -523,7 +520,6 @@ func (t *trace) finishedOne(s *span) { | |||
} | |||
|
|||
func (t *trace) finishChunk(tr *tracer, ch *chunk) { | |||
atomic.AddUint32(&tr.spansFinished, uint32(len(ch.spans))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Why did we relocate the trigger point for the span start/finish tracking from this location to the new locations? Would be great to capture the answer to this question in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I'll add that. tl;dr is finishChunk
reports a slice of spans at a time, which would require looping over this slice to get each span's integration
value and report that as a tag. Rather than introduce this for loop, we move the logic somewhere else.
spansStarted, spansFinished, tracesDropped uint32 | ||
// These maps count the spans started and finished from | ||
// each component, including contribs and "manual" spans. | ||
spansStarted, spansFinished *xsync.MapOf[string, *atomic.Int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: It would be nice to abstract the concept of a counting map into a dedicated type that lives in an internal package. We have a use for this in profiling as well.
However, I don't think this needs to be done in this PR. I can do it as a follow-up change for profiling.
ddtrace/tracer/metrics.go
Outdated
value.Swap(0) | ||
if err != nil { | ||
log.Debug("Error while reporting spans started from integration %s: %s", key, err.Error()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we logging the error here? I mean, in general it's a good idea, but it wasn't being done in the old code.
What kind of errors could happen here? Could we end up flooding the debug log? Are we concerned about locking the xmaps for too long while doing the logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, the idea was so that we can track down a reason for potentially missing span metrics. Though looking at the code again, it doesn't seem that t.statsd.Count()
actually ever returns an error (see the function it calls, addMetric()
here). So the easy answer is that perhaps we are not worried at all about flooding the log or locking for too long.
It seems that previously existing calls to this function did not check for err
. For the time being, I will also remove it to match these instances.
@@ -48,7 +52,7 @@ func TestReportHealthMetricsAtInterval(t *testing.T) { | |||
var tg statsdtest.TestStatsdClient | |||
|
|||
defer func(old time.Duration) { statsInterval = old }(statsInterval) | |||
statsInterval = time.Nanosecond | |||
statsInterval = time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
753232f mentions this was done to fix flakiness. Was this flakiness pre-existing, or introduced by this PR? How did you verify that this fixes the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[New commit is at bb7be55. Rebase accidentally moved some things around].
The flake was introduced with this PR. It was caused by issues in tg.Wait()
, which waits for n
(in this case 4
) reported metrics. But since we are newly reporting the number of spans started/finished even if this value is 0
, we are reporting at least two metrics for every statsInterval
, which is why I decided to increase it here.
I verified it by simply running it 1000
+ times locally. It doesn't fail at all, which made me believe that it was no longer flaking.
bc2c799
to
9425178
Compare
9425178
to
4cc70c1
Compare
What does this PR do?
Add a
integration
tag to the existingdatadog.tracer.spans_started
anddatadog.tracer.spans_finished
metrics. The value of the tag will be the name of the component from which the span was started. For example, for a contrib, it will be the name of the contrib package (chi
,net/http
, etc). For spans that were created manually, the tag will saymanual
.For the purpose of adding tags, we move the logic of counting finished spans to
span.finish()
rather thantrace.finishChunk()
. Since we must read theintegration
data from each individual span, we would rather increment our counter each time one span finishes. Counting infinishChunk()
as we did previously would require a for loop, which might impact efficiency.Motivation
We want to know, in addition to when a span is started, where the span originated from. This could be a contrib or a manual implementation.
Reviewer's Checklist
v2-dev
branch and reviewed by @DataDog/apm-go.Unsure? Have a question? Request a review!